Day 10 [Python ML、特徵工程] 特徵生成

2021 iThome 鐵人賽

DAY 10

AI & Data

使用python學習Machine Learning系列第 10 篇

13th鐵人賽

guancioul

團隊人工逗點智慧

2021-09-23 22:32:33

2261 瀏覽

分享至

匯入基線模型(Baseline model)

%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
from sklearn.preprocessing import LabelEncoder

ks = pd.read_csv('./ks-projects-201801.csv',
                 parse_dates=['deadline', 'launched'])

# Drop live projects
ks = ks.query('state != "live"')

# Add outcome column, "successful" == 1, others are 0
ks = ks.assign(outcome=(ks['state'] == 'successful').astype(int))

# Timestamp features
ks = ks.assign(hour=ks.launched.dt.hour,
               day=ks.launched.dt.day,
               month=ks.launched.dt.month,
               year=ks.launched.dt.year)

# Label encoding
cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()
encoded = ks[cat_features].apply(encoder.fit_transform)

data_cols = ['goal', 'hour', 'day', 'month', 'year', 'outcome']
baseline_data = ks[data_cols].join(encoded)

Interactions

可以組合不同的feature來產生新的feature

interactions = ks['category'] + "_" + ks['country']
print(interactions.head(5))

0            Poetry_GB
1    Narrative Film_US
2    Narrative Film_US
3             Music_US
4      Film & Video_US
dtype: object

將資料經過labelencoder處理，這樣model才能夠讀取資料

label_enc = LabelEncoder()
data_interaction = baseline_data.assign(category_country=label_enc.fit_transform(interactions))
data_interaction.head()

Number of projects in the last week

若是想知道一個禮拜前的資料有幾筆，該如何處理
可以使用.rolling的方法

將資料中launched的column當作是index，創造一個series，並且將index資料設定成value

launched = pd.Series(ks.index, index=ks.launched, name="count_7_days").sort_index()
launched.head(20)

launched
1970-01-01 01:00:00     94579
1970-01-01 01:00:00    319002
1970-01-01 01:00:00    247913
1970-01-01 01:00:00     48147
1970-01-01 01:00:00     75397
1970-01-01 01:00:00      2842
1970-01-01 01:00:00    273779
2009-04-21 21:02:48    169268
2009-04-23 00:07:53    322000
2009-04-24 21:52:03    138572
2009-04-25 17:36:21    325391
2009-04-27 14:10:39    122662
2009-04-28 13:55:41    213711
2009-04-29 02:04:21    345606
2009-04-29 02:58:50    235255
2009-04-29 04:37:37     98954
2009-04-29 05:26:32    342226
2009-04-29 06:43:44    275091
2009-04-29 13:52:03    284115
2009-04-29 22:08:13     32898
Name: count_7_days, dtype: int64

當我們有一個timeseries的index，可以使用.rolling去選取需要滑動的窗格
Example launched.rolling('7d')可以創建一個滑動7天的窗格
為了不要計算到現在的時間，因此我們要將資料-1

count_7_days = launched.rolling('7d').count()-1
print(count_7_days.head(20))

# Ignore records with broken launch dates
plt.plot(count_7_days[7:]);
plt.title("Number of projects launched over periods of 7 days");

launched
1970-01-01 01:00:00     0.0
1970-01-01 01:00:00     1.0
1970-01-01 01:00:00     2.0
1970-01-01 01:00:00     3.0
1970-01-01 01:00:00     4.0
1970-01-01 01:00:00     5.0
1970-01-01 01:00:00     6.0
2009-04-21 21:02:48     0.0
2009-04-23 00:07:53     1.0
2009-04-24 21:52:03     2.0
2009-04-25 17:36:21     3.0
2009-04-27 14:10:39     4.0
2009-04-28 13:55:41     5.0
2009-04-29 02:04:21     5.0
2009-04-29 02:58:50     6.0
2009-04-29 04:37:37     7.0
2009-04-29 05:26:32     8.0
2009-04-29 06:43:44     9.0
2009-04-29 13:52:03    10.0
2009-04-29 22:08:13    11.0
Name: count_7_days, dtype: float64

現在我們有7天內有幾筆資料的Series，現在我們要將這些資料加入training data

# 將時間資料改為index
count_7_days.index = launched.values
# 將時間資料fit原始資料的index
count_7_days = count_7_days.reindex(ks.index)
# 也可以用sort_index
# count_7_days = count_7_days.sort_index()

count_7_days.head(10)

0    1487.0
1    2020.0
2     279.0
3     984.0
4     752.0
5     522.0
6     708.0
7    1566.0
8    1048.0
9     975.0
Name: count_7_days, dtype: float64

baseline_data.join(count_7_days).head(10)

Time since the last project in the same category

若是想投資一個遊戲，但另一個相同類型的遊戲才剛上市，就會賺不到錢
因此我們需要去抓出在同一個類型中，最後上市的時間

我們需要先將資料做.groupby在用.transform.
.transform

def time_since_last_project(series):
    # Return the time in hours
    return series.diff().dt.total_seconds() / 3600.

df = ks[['category', 'launched']].sort_values('launched')
df.head(20)

timedeltas = df.groupby('category').transform(time_since_last_project)
timedeltas.head(20)

將NaN利用timedeltas的median或是mean填入，再reindex，才能將資料加入其他data中

timedeltas = timedeltas.fillna(timedeltas.median()).reindex(baseline_data.index)
timedeltas.head(10)

Transforming numerical features

"goal"這一個column中顯示很多比賽的獎金少於5000USD
但有些比賽的獎金卻高達100,000USD

所以我們需要對數值資料做處理，去移除這些離群值(outliers)

一些比較常見的選擇為square root和natual logarithm

plt.hist(ks.goal, range=(0, 100000), bins=50);
plt.title('Goal');

從上圖中可以看出，接近0的數值非常多，但接近100,000的值卻很少

plt.hist(np.sqrt(ks.goal), range=(0, 400), bins=50);
plt.title('Sqrt(Goal)');

將資料開根號之後可以讓資料較為平緩

plt.hist(np.log(ks.goal), range=(0, 25), bins=50);
plt.title('Log(Goal)');

將資料做log處理後，資料會變得較為集中

這類型的轉換對tree-based model的影響並不大
但卻能幫助linear model或neural network

產生特徵的方法很多，但都要基於經驗
另外一個方法就是我們可以產生很多特徵，並且利用特徵篩選的工具來選擇較好的特徵

Day 9 [Python ML、特徵工程] 分類工程

Day 11 [Python ML、特徵工程] 特徵選擇

系列文

使用python學習Machine Learning 共 29 篇

RSS系列文訂閱系列文

5 人訂閱

完整目錄

直播研討會

{{ item.channelVendor }} {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22084 篇

完賽人數

594 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

IT邦幫忙

使用python學習Machine Learning系列 第 10 篇